Skip to content

fix(neo4j): self-healing cross-session edge writes + unified :Node identity (v4.0.2)#21

Merged
colombod merged 1 commit into
mainfrom
fix/cross-session-edge-self-heal-node-identity
Jun 19, 2026
Merged

fix(neo4j): self-healing cross-session edge writes + unified :Node identity (v4.0.2)#21
colombod merged 1 commit into
mainfrom
fix/cross-session-edge-self-heal-node-identity

Conversation

@colombod

Copy link
Copy Markdown
Collaborator

Summary

Cross-session graph edges were silently dropped when an endpoint node was not yet committed at flush time. This makes the chunked-flush edge writer self-healing and unifies node identity on the universal :Node label so the fix holds across the whole ingest pipeline. Bumps the server to v4.0.2.

Root cause

_edge_merge_cypher built every cross-session edge with two MATCH clauses:

UNWIND $rows AS row
MATCH (src:Node {node_id: row.src_id, workspace: $workspace})
MATCH (dst:Node {node_id: row.dst_id, workspace: $workspace})
MERGE (src)-[r:TYPE]->(dst)

The two MATCH clauses are an inner join: if either endpoint was not yet committed, the row produced no bindings, the MERGE never ran, and the edge was dropped with no error, no log, no warning. This affects every cross-layer/cross-session edge the pipeline writes (SOURCED_FROM ×27, plus HAS_PART, TRIGGERED, HAS_STEP, and ~20 other edge types), all of which legitimately reference endpoints that may not exist at the writing flush.

Fix

  1. _edge_merge_cypher: MATCHMERGE both endpoints, then MERGE the relationship. An absent endpoint becomes a :Node placeholder rather than a dropped edge.
  2. Session node writer re-keyed: MERGE (n:Session {…})MERGE (n:Node {…}) SET n:Session. Identity keys on the universal :Node label so a placeholder converges with the later typed write instead of forking into two nodes; the :Session label is still applied.
  3. Schema (ensure_neo4j_schema): add a :Node(node_id, workspace) uniqueness constraint as the atomicity guard for concurrent MERGEs (the role the :Session constraint played for the Session MERGE); deduplicate (node_id, workspace) globally; run the universal-:Node backfill (previously unreachable behind an early return); drop the legacy plain idx_node_universal index (a uniqueness constraint carries its own backing index and cannot coexist with a standalone index on the same key); log backfill completion.
  4. Sync the mocked schema/MERGE unit tests to the :Node-keyed identity.

Migration

The universal :Node label was introduced in #19, but its backfill shipped unreachable, so the graph holds nodes written before #19 that lack the label. ensure_neo4j_schema migrates idempotently — global dedup → backfill → verify → drop legacy index → :Node constraint — before any write. On the large production graph, run it as a verified two-phase deploy. Full runbook, verification queries, and rollback: docs/node-identity-migration.md.

Evidence

  • tests/neo4j (real Neo4j): 69 passed — includes a RED→GREEN proof of the silent drop (tests/neo4j/test_silent_edge_drop.py) and a dirty-graph migration test (tests/neo4j/test_node_identity_migration.py).
  • Non-Neo4j unit suite: 1373 passed, 2 skipped.
  • ruff clean on all changed files.

Reviewer checklist

  • ensure_neo4j_schema order: dedup → backfill → verify → drop legacy index → :Node constraint, before any write (_flush_body awaits _ensure_schema first).
  • Session writer MERGEs identity on :Node and SETs :Session.
  • Edge writer MERGEs both endpoints.
  • Concurrent-MERGE atomicity comes from the :Node uniqueness constraint.

…entity (v4.0.2)

The chunked-flush edge writer built every cross-session edge with two MATCH
clauses, silently dropping edges when endpoints were not yet committed. Fix:
MERGE both endpoints to create :Node placeholders, re-key node identity on the
universal :Node label so placeholders converge with later typed writes, add a
:Node(node_id, workspace) uniqueness constraint as the atomicity guard for
concurrent MERGEs, make the previously-dead universal-:Node backfill reachable,
and drop the legacy plain index.

The :Node label was introduced in PR #19 but its backfill shipped unreachable,
leaving pre-#19 nodes untagged. The migration at startup (dedup → backfill →
verify → drop legacy index → :Node constraint) safely brings the production
graph into the state B′ assumes, before any write. Full operator runbook in
docs/node-identity-migration.md.

Bumps server to v4.0.2.

Evidence:
- tests/neo4j: 69 passed (includes RED→GREEN silent-drop proof +
  dirty-graph migration test)
- Non-neo4j unit suite: 1373 passed, 2 skipped
- ruff: clean on all changed files

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
@bkrabach

Copy link
Copy Markdown
Collaborator

Tested #21 against a clone of a real 1.4M-node graph before merge — green light from our side. 👍

We snapshotted a live context-intelligence graph (1,398,163 nodes / 1,937,529 rels) into a throwaway Neo4j and ran the migration there. Production was never touched.

Tests (independent run):

  • tests/neo4j/ -m neo4j69 passed
  • unit (-m "not neo4j and not integration") → 1326 passed
  • ruff clean

ensure_neo4j_schema on the real graph:

Scenario Result
Graph as-is (already :Node-tagged from #19) 9.2s · returned True · 1 stray duplicate removed · :Node constraint created · 0 remaining dup groups
:Node stripped from all 1.4M nodes, re-run (exercises the resurrected backfill) 9.8s · backfilled 1,398,162 nodes → 0 untagged · constraint created · 0 dups · 0 data loss

So: ~10s end-to-end at real scale, no OOM, no errors. The dead-code backfill now runs and completes, and the drop-legacy-index → create-:Node-constraint handoff lands cleanly (no window where the graph is left without a backing index).

One non-blocking suggestion: the Step-1 global dedup (MATCH (n) WITH ... collect(n) ... DETACH DELETE) is the only un-batched pass in the migration. It was fine here — our graph had exactly 1 duplicate, so the collect groups are trivial — but it materializes a group-map over all nodes in heap. We tested on a large (ergonomic) JVM heap, so we can't certify it's OOM-proof on a memory-constrained Neo4j. Worth wrapping in CALL { ... } IN TRANSACTIONS like the backfill right below it — cheap insurance against the exact unbounded-transaction shape that motivated the chunking work originally.

Net: safe to merge from a real-data-scale standpoint. The self-healing edge writer + backfill resurrection both check out. Nice work.

@colombod

Copy link
Copy Markdown
Collaborator Author

Thanks for running it against a real 1.4M-node clone, @bkrabach — that's the validation that matters here.

On the Step-1 dedup batching suggestion: I tried it two ways and both made things worse on a memory-constrained Neo4j, so I'm keeping the un-batched form. Evidence:

  • Wrapping the DETACH DELETE in CALL { … } IN TRANSACTIONS — both the original collect(n) form and a lighter scalar-count(*) driving form — trips the 2 MiB db.memory.transaction.max integration test (tests/neo4j/test_orphan_visibility.py::test_finalization_orphan_surfaces_on_status) with MemoryPoolOutOfMemoryError.
  • Root cause: CALL { … } IN TRANSACTIONS eagerly materializes its driving rows before the first inner batch, and the dedup needs a grouping aggregation to find duplicate (node_id, workspace) keys — so peak memory goes up on a constrained heap, not down. The :Node backfill can use IN TRANSACTIONS safely only because its driving query is a plain streaming MATCH (n) WHERE NOT n:Node with no aggregation buffer.
  • The un-batched dedup passes that exact 2 MiB-capped test (and the full tests/neo4j suite, 69 passed) in addition to your 1.4M-node run — i.e. it's the form that actually survives the memory-constrained scenario the suggestion was guarding against.

If we ever need hard bounding for a pathologically duplicated graph, the right shape is a streaming per-key delete that avoids any full-graph grouping — but that wants the :Node index in place first, so it's a post-migration follow-up rather than something to fold into this pass. For the data we have (your clone had a single duplicate group), the un-batched dedup is correct and bounded enough. Leaving it as-is.

@colombod colombod merged commit 550c127 into main Jun 19, 2026
3 checks passed
@colombod colombod deleted the fix/cross-session-edge-self-heal-node-identity branch June 19, 2026 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants